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Abstract 

Video segmentation is a stepping stone to understand¬ 
ing video context. Video segmentation enables one to repre¬ 
sent a video by decomposing it into coherent regions which 
comprise whole or parts of objects. However, the challenge 
originates from the fact that most of the video segmentation 
algorithms are based on unsupervised learning due to ex¬ 
pensive cost of pixelwise video annotation and intra-class 
variability within similar unconstrained video classes. We 
propose a Markov Random Field model for unconstrained 
video segmentation that relies on tight integration of multi¬ 
ple cues: vertices are defined from contour based superpix¬ 
els, unary potentials from temporal smooth label likelihood 
and pairwise potentials from global structure of a video. 
Multi-cue structure is a breakthrough to extracting coherent 
object regions for unconstrained videos in absence of super¬ 
vision. Our experiments on VSBIOO dataset show that the 
proposed model significantly outperforms competing state- 
of-the-art algorithms. Qualitative analysis illustrates that 
video segmentation result of the proposed model is consis¬ 
tent with human perception of objects. 

1. Introduction 

Video segmentation is one of the important problems in 
video understanding. A video may contain a set of objects, 
from stationary to those undergoing dependent or indepen¬ 
dent motion. Human understands a video by recognizing 
objects and infers the video context(i.e. what is happening 
in the video) by observing their motion. Depending on the 
video context, parts or whole objects will have structured 
motion correlation. However, there may be unrelated enti¬ 
ties such as background or auxiliary objects which form ad¬ 
ditional structures as well. Holistic representation of a video 
cannot effectively decompose and extract meaningful struc¬ 
ture and it may increase intra-variability of a video class. 
The goal of video segmentation is to obtain coherent object 
regions over frames so that a video can be represented as a 
set of objects and a meaningful structure can be extracted. 


Ideally, the ultimate goal of video segmentation is to ob¬ 
tain pixelwise semantic segmentation of videos, where the 
objective is not only to partition a video into object regions 
but to infer object label of each region. Semantic segmenta¬ 
tion is actively investigated in urban driving scene under¬ 
standing O O [6l [191. However, the urban scene videos 
contain rigid objects such as buildings, cars or road with 
typically smooth motion. In general, it is more challenging 
to segment and classify object regions in general, uncon¬ 
strained videos. First, the labor cost of obtaining pixelwise 
label annotation in video can be extremely high. Instead, 
most of the datasets provide bounding box annotations on 
major objects without providing full frame coverage. In ad¬ 
dition, typical video datasets display high intra-class vari¬ 
ability. Objects or human subjects are deformable and their 
appearance would change due to changing illumination over 
frames. Furthermore, motion pattern of objects in the same 
class of a video may exhibit idiosyncrasy. Because of these 
aspects, learning a robust classifier for each and every ob¬ 
ject in a video remains, at present, an insurmountable task. 

Another fundamental challenge in video segmentation 
is that the inherent video object hierarchy may be highly 
subjective. Annotations of multiple human annotators may 
vary significantly. For example, one annotator may assign 
a single label to the whole human body, whereas another 
annotator will label torso and leg part separately. Further¬ 
more, some objects may not have strong correlation to one 
feature alone. For example, an object may have parts that 
show different color patterns but move consistently. Hence, 
in practice, one may induce a hierarchical video segmen¬ 
tation with different levels of granularity from aggregated 
information of multi-cue feature channels. 

In this paper, we propose a novel hierarchical video seg¬ 
mentation model which integrates temporal smooth labels 
with global structure consistency with preserving object 
boundaries. Our contributions are as follows: 

• We propose a video segmentation model that preserves 
multi-cue structures of object boundary and temporal 
smooth label with global spatio-temporal consistency. 
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Figure 1: Overview of the framework, (a) Node potential depends on histogram of temporal smooth pixelwise labels of the 
corresponding frame. Spatial edge potentials: (b) Gray intensity represents contour strength, (c) RGB color is displayed 
for better visualization, (d) Color represents motion direction, (e) Color represents visual word identity of each dense SIFT 
feature. Temporal edge potential depends on correspondence ratio on long trajectory and color affinity, (f) Superpixels 
for corresponding vertices in the frame/are illustrated by object contours. For visualization purpose, it shows coarse grained 
superpixels. Best viewed in color. 


• We propose an effective pairwise potential to represent 
spatio-temporal structure evaluated on object bound¬ 
ary, color, optical fiow, texture and long trajectory cor¬ 
respondence. 

• Video hierarchy is inferred through the process of 
graph edge consistency, which generalizes traditional 
hierarchy induction approaches. 

• The proposed method infers precise coarse grained 
segmentation, where a segment may represent one 
whole object. 

The remainder of this paper is organized as follows. Sec¬ 
tion [2] describes a set of related work and their limitations. 
Our proposed model is introduced in Section 1^ Experi¬ 
ments set up and results are described in Section^ followed 
by concluding remarks in Section 

2. Related Work 

One of the main objectives of video segmentation is 
to obtain spatio-temporal smoothness of the region labels. 
Grundmann et al. mol proposed a greedy agglomerative 
clustering algorithm that merges two adjacent superpixels 
if their color difference is smaller than internal variance of 
each superpixel. Granularity of the segmentation is con¬ 
trolled by adding a parameter to internal variance. The al¬ 


gorithm obtains spatio-temporal smoothness on segment la¬ 
bels since it merges two adjacent superpixels. In addition, 
it effectively detects newly appeared object due to the ag¬ 
glomerative clustering. However, they only focus on color 
information without spatio-temporal structure. As a conse¬ 
quence, it may merge a part of an object with another ob¬ 
ject or with the background, especially in the coarse-grained 
segmentation. Furthermore, the approach does not extract 
object boundaries effectively because the algorithm does 
not make use of spatial structure from image gradients or 
edge detectors. 

Object boundary contour extracts spatial structure for 
image data. Arbelaez et al. HI introduced a hierarchical 
contour detector for image segmentation. Their framework 
starts with best angular edge response on each pixel and ag¬ 
glomerative clustering constructs a hierarchical object con¬ 
tour map. It is capable of detecting object boundaries even 
in a low contrast image where the object appearance is less 
distinctive to the background. 

Moreover, the contour strength provides a cue to under¬ 
stand spatial structure. It is likely that a strong contour sep¬ 
arates an object to other objects, while a weak contour sep¬ 
arates two parts inside of an object. However, the algorithm 
is applicable only to image data and it is not trivial to ex¬ 
tend to a video dataset. The algorithm processes each video 
frame independently and produces object regions within 
each image. It requires to match regions correspond to an 






























object across frames to obtain temporal smoothness of seg¬ 
mentation. 

Galas so et al. (91 aim to obtain correspondence of su¬ 
perpixels across video frames by propagating labels from 
a source frame along the optical flow. However, the quality 
of propagated labels typically decays due to flow estimation 
errors as the distance from the source frame increases. They 
propose a remedy by propagating from the center frame, not 
taking into account global label consistency over the full 
video sequence. Another limitation is that this label propa¬ 
gation approach cannot introduce objects because the label 
set of source frame does not contain a label corresponds 
to the new object. In motion segmentation, Elqursh and El- 
gammal O resolve the issue by splitting a group of trajecto¬ 
ries if their dissimilarity becomes dominant. However, the 
robustness of this approach depends highly on the choice 
of a threshold parameter, which needs to be tuned for each 
video. 

On the other hand, robust temporal structure informa¬ 
tion can be extracted from long-term trajectories. Ochs et 
al. ca introduce a video segmentation framework that de¬ 
pends on long-term point trajectories from large displace¬ 
ment optical flow (H. They start with spatially sparse tra¬ 
jectory labels which are obtained by regularized spectral 
clustering on motion difference among trajectories. Dense 
region labels are inferred by Potts energy minimization. Al¬ 
though the proposed approach attains robust temporal con¬ 
sistency, it cannot distinguish objects of identical motion 
pattern because the trajectory label only depends on motion. 
Nonetheless, the long trajectories offer a good cue to infer¬ 
ring long range temporal structure in a video. Eor instance, 
two superpixels in distant frames can be hypothesized to 
have common identity if they share sufficiently many pixel 
trajectories. 

Galasso et al. ID aggregate a set of pairwise affinities 
in color, optical flow direction, long trajectory correspon¬ 
dence and adjacent object boundary. With aggregated pair¬ 
wise affinity, they adopt spectral clustering to infer seg¬ 
ment labels. Spectral clustering is one of the standard algo¬ 
rithms in the segmentation problem. However, Nadler and 
Galun (Ml illustrate cases where spectral clustering fails 
when the dataset contains structures at different scales of 
size and density for different clusters. 

We propose a Markov Random Eield(MRE) model 
whose vertices are deflned from object contour based su¬ 
perpixel. The model takes temporal smooth label likelihood 
as node potentials and global spatio-temporal structure in¬ 
formation is incorporated as edge potentials in multi-modal 
feature channels, such as color, motion, object boundary, 
texture and long trajectories. Since the proposed model 
takes contour based superpixels as vertices, the inferred seg¬ 
mentation preserves good object boundaries. In addition, 
the model enhances long range temporal consistency over 


label propagation by incorporating global structure. More¬ 
over, we aggregate multi-modal features in the video so that 
the model can distinguish objects of identical motion. Ei- 
nally, MRE inference with unary and pairwise potential re¬ 
sults in accurate segmentation compared to spectral cluster¬ 
ing which only relies on pairwise relationship. 

As a result, the proposed model infers video segmenta¬ 
tion labels by preserving accurate object boundaries which 
are locally smooth and consistent to global spatio-temporal 
structure of the video. 

3. Proposed Model 

3.1. Multi-Cue Structure Preserving MRF Model 

An overview of our framework for video segmentation 
is depicted in Eigure A video is represented as a graph 
Q = (V, f), where a vertex set V = • • • , is de¬ 

flned on contour based superpixels from all frames / G 
{1, • • • , F} in the video. Eor each frame, an object con¬ 
tour map is obtained from contour detector (U. A region 
enclosed by a contour forms a superpixel. An edge set 
£ = describes relationship for each pair of ver¬ 

tices. The edge set consists of spatial edges eij G £^ 
where G and temporal edges Cij G £^ where 

f. 

Video segmentation is obtained by MAP inference on a 
Markov Random Eield Y = {yi\i G G £}on this 

graph Q, where P{Y) = ^ exp(—F(Y)) and Z is the par¬ 
tition function. Vertex i is labeled as yi from the label set 
jC of size L. MAP inference is equivalent to the following 
energy minimization problem. 


min E{Y) = 0V’ii : <lij, (1) 

iev (i,j)es 

s-t'^Pi{l) = 1, Vi e V (2) 

lec 

^{i,j)e£,leC (3) 

I'ec 

Pi e {0,1}^, VieV (4) 

qi,-e {0,1}^X^, y{i,j)€£ (5) 


In (0, (j)- represents node potentials for a vertex i G V 
and is edge potentials for an edge eij G f. As with 
the edge set f, edge potentials are decomposed into spatial 
and temporal edge potentials, -0 = The vector 

Pi indicates label yi and is the label pair indicator ma¬ 
trix for yi and yj. Operators • and : represent inner product 
and Erobenius product, respectively. Spatial edge potentials 
are deflned for each edge which connects the vertices in the 
same frame G . In contrast, temporal edge poten¬ 
tials are deflned for each pair of vertices in the different 
frames i E G , / 7 ^ /'. It is worth noting that the 


proposed model includes spatial edges between two vertices 
that are not spatially adjacent and, similarly, temporal edges 
are not limited to consecutive frames. 

A set of vertices of the graph is defined from contour 
based superpixels such that the inferred region labels will 
preserve accurate object boundaries. Node potential param¬ 
eters are obtained from temporally smooth label likelihood. 
Edge potential parameters aggregate appearance and mo¬ 
tion features to represent global spatio-temporal structure 
of the video. MAP inference of the proposed Markov Ran¬ 
dom Field(MRF) model will infer the region labels which 
preserve object boundary, attain temporal smoothness and 
are consistent to global structure. Details are described in 
the following sections. 


3.2. Node Potentials 

Unary potential parameters G represent a cost of 
labeling vertex i G V from a label set C. While edge poten¬ 
tials represent global spatio-temporal structure in a video, 
node potentials in the proposed model strengthen temporal 
smoothness for label inference. Temporal smooth label set 
C is obtained from a greedy agglomerative clustering ITOl . 
The clustering algorithm merges two adjacent blobs in a 
video when color difference is smaller than the variance of 
each blob. Node potential parameters 0^ represent labeling 
cost of vertex i from negative label likelihood h-. 

= -hL (6) 

= ,h\{L)]/H, (7) 

L 

H = Y.h\{b). (8) 

6=1 


Each superpixel is evaluated by pixelwise cluster labels 
from C and the label histogram represents label likeli¬ 
hood for the vertex i. As illustrated in Figure (a), a su¬ 
perpixel has a mixture of pixelwise temporal smooth labels 
because the agglomerative clustering cni merges unstruc¬ 
tured blobs. Fet h\{h) be the number of pixelwise tempo¬ 
ral smooth label b in the corresponding superpixel of vertex 
i. As described in 3J_ a vertex is defined on a superpixel 
which is enclosed by an object contour. Arbelaez et al. ifTI 
extract object contours so that taking different threshold val¬ 
ues on the contours will produce different granularity levels 
of enclosed regions. In our proposed model, we take a set 
of vertices from a video frame / by a single threshold 
on contours which results in fine-grained superpixels. 


3.3. Spatial Edge Potentials 

Binary edge potential parameters ip consist of two differ¬ 
ent types; spatial and temporal edge potentials, and 
respectively . Spatial edge potentials pjfj model pairwise 
relationship of two vertices i and j within a single video 


frame /. We define these pairwise potentials as follows: 




'^ij 
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0 


otherwise 

(9) 


A spatial edge potential parameter I') is the (/,/') el¬ 
ement of ^ ^ matrix which represents the cost of labeling 
a pair of vertices i and j as I and respectively. It takes 
Potts energy where all different pairs of label take homoge¬ 
neous cost Spatial edge potentials are decomposed 
into ip^ ^ ip^ ^ ip^ ^ ip^, which represent pairwise potentials in 
the channel of object boundary, color, optical fiow direction 
and texture. Pairwise cost of having different labels is high 
if the two vertices i and j have high affinity in the corre¬ 
sponding channel. As a result, edge potentials increase the 
likelihood of assigning the same label to vertices i and j 
during energy minimization. 

The edge potentials take equal weights on all channels. 
Importance of each channel may depend on video context 
and different videos have dissimilar contexts. Teaming 
weights of each channel is challenging and it is prone to 
overfitting due to high variability of video context and lim¬ 
ited number of labeled video samples in the dataset. Hence, 
the propose model equally weights all channels. 

The model controls the granularity of segmentation by 
a threshold r. In (|^, the pairwise potential is thresholded 
by r. If r is set to a high value, only edges with higher 
affinity will be included in the graph. On the other hand, if 
we set a low value to r, the number of edges increases and 
more vertices will be assigned to the same label because 
they are densely connected by the edge set. We next discuss 
each individual potential type in the context of our video 
segmentation model. 

Object Boundary Potentials 'ip^. Object boundary poten¬ 
tials 'iplj evaluate cost of two vertices i and j in the same 
frame assigned to different labels in terms of object bound¬ 
ary information. The potential parameters are defined as 
follows: 


^p\j — ex.p{—dMMPw{hj)/lb)- ( 10 ) 

where duupwihj) represents the minimum boundary path 
weight among all possible paths from a vertex i to j. 
The potentials are obtained from Gaussian Radial Basis 
Function(RBF) of dMMV^{h j) with 75 which is the mean 
of (iMMPw (^5 j) as a normalization term. 

If the two superpixels i and j are adjacent, their object 
boundary potentials are decided by the shared object con¬ 
tour strength b{eij), where eij is the edge connects ver¬ 
tices i and j and the boundary strength is estimated from 
contour detector m The boundary potentials can be ex¬ 
tended to non-adjacent vertices i and j by evaluating a path 
weight from vertex i to j. For each path p from a vertex 




Algorithm 1 Minimum Max-edge Path Weight 
1 : procedure MMPW(V, E) 

2: d i — OO 

3: fori; G V do 

4: d[v] [i;] ^ 0 

5: for (ii, v) e £ do 

6 : d[u][v] ^ b{euv) > assign boundary score 

7: for k eV do 

8: for I G V do 

9: for j G V do 

10 : if d[i] [j] > maxid[i] [k],d[k] [j]) then 

11 : d[z][j]^max(d[*][fc],d[fc][j]) 

12: return ^mmpw ^ d 


i to j, boundary potential of path p is evaluated by taking 
the maximum edge weights b{euv) where Cuv is an edge 
along the pathp. The algorithm to calculate duMPwihj) is 
described in Algorithmic which modifies Floyd-Warshall 
shortest path algorithm. 

Typically, a path in a graph is evaluated by sum of 
edge weights along the path. However, in case of bound¬ 
ary strength between the two non-adjacent vertices in the 
graph, total sum of the edge weights along the path is not 
an effective measurement because the sum of weights is bi¬ 
ased toward the number of edges in the path. For example, 
a path consists edges of weak contour strength may have 
the higher path weight than another path which consists of 
smaller number of edges with strong contour. Therefore, we 
evaluate a path by the maximum edge weight along the path 
and the path weight is govern by an edge of the strongest 
contour strength. 

Figure illustrates two different path weight models of 
the max edge weight and the sum edge weight. Figure 
(a) illustrates contour strength where red color represents 
high strength. Two vertices indicated by white arrows are 
selected in an airplane. In Figure [C(b), two paths are dis¬ 
played. Path 2 consists of less number of edges but it inter¬ 
sects with a strong contour that represents boundary of the 
airplane. If we evaluate object boundary score between the 
two vertices. Path 1 should be considered since it connects 
vertices within the airplane. Figure (c) shows edge sum 
path weight from a vertex at tail to all the other vertices. 
It displays that the minimum path weight between the two 
vertices are evaluated on Path 2. On the other hand. Figure 
(d) illustrates that max edge path weight takes Path 1 as 
minimum path weight which conveys human perception of 
object hierarchy. 

Color Potentials Color feature for each vertex is rep¬ 
resented by a histogram of CIELab color space in the corre¬ 
sponding superpixel. Color potential between the vertex 


i and j is evaluated on two color histograms and h^: 

V’ij = exp(-dEMD(h-,h5)/7c). (11) 

where dEMD(hi 5 hp Earth Mover’s Distance(EMD) he- 
tween h? and hj of vertices i and j and jc is the normaliza¬ 
tion parameter. 

Earth Mover’s Distance na is a distance measurement 
between two probability distributions. EMD is typically 
more accurate over distance in color space of super¬ 
pixels. An issue with distance is that if the two his¬ 
tograms on simplex do not share non-zero color bins, the 
two histogram are evaluated with the maximum distance of 
1. Therefore, distance of vertices i and j is the same as the 
distance between i and k, if i^j^k do not share any color 
bins. This occurs often when we compare color feature of 
superpixels because superpixel is intended to exhibit coher¬ 
ent color especially in the fine grained level. Superpixels 
on different objects or different parts of an object may have 
different colors. For example, if we use distance to mea¬ 
sure color difference of superpixels, distance between su¬ 
perpixels of red and orange will have the same distance of 
red and blue because they do not share color bins. However, 
this is not intuitive to human perception. In contrast, EMD 
considers distance among each color bin, hence it is able to 
distinguish non overlapping color histograms. 

Optical Flow Direction Potentials In each video 
frame, motion direction feature of ith vertex can be obtained 
from a histogram of optical fiow direction h^. As with the 
case of color potentials, we use EMD between the two his¬ 
tograms i^ and ij to accurately estimate difference direction 
in motion: 

ipij = exp(-dEMD(h°,h°)/7o) (12) 

where 70 is the mean EMD distance on optical fiow his¬ 
togram. 

Texture Potentials 1 /;^. Dense SIET features are extracted 
for each superpixel and B ag-of-Words(BoW) model is ob¬ 
tained from K-means clustering on D-SIET features. We 
evaluate SIET feature on multiple dictionaries of different 
K. Texture potentials 1 /;^ are calculated from RBE on x^ 
distance of two BoW histograms hf and h^, which is a typ¬ 
ical choice of distance measurement for BoW model: 

tpij = exp(-c(^2(h®,hJ)/7^) (13) 

where parameter 'y^ is the mean distance on D-SIFT 
word histogram. 

3.4. Temporal Edge Potentials 

Temporal edge potentials define correspondence of ver¬ 
tices at different frames. It relies on long trajectories which 
convey long range temporal dependencies and more robust 







(a) Contour strength (b) Two contour paths (c) Edge sum path weight (d) Max edge path weight 

Figure 2: Comparison of two types of path weight models. 


than optical flow. 


4. Experimental Evaluation 
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(16) 

i’ij = exp(-dEMD(h-,hj)/7c). 

(17) 


where Ti is a set of long trajectories which pass through ver¬ 
tex i. Pairwise potential 'iplj represents temporal correspon¬ 
dence of two vertices from overlapping ratio of long trajec¬ 
tories that vertices i and j shares, where i G j G 
and / / /'. In order to distinguish two different objects of 
the same motion, we integrate color potentials between 
two vertices. Long trajectories are extracted from ca. 

3.5. Hierarchical Inference on Segmentation Labels 

The proposed model attains hierarchical inference of 
segmentation labels by controlling the number of edges 
with a flxed set of vertices deflned at a flnest level of su¬ 
perpixels. As the edge set becomes dense in the graph, the 
energy function in Q takes higher penalties from the pair¬ 
wise potentials. As a consequence, vertices connected by 
dense edges will be assigned to the same label and it leads 
to coarse-grained segmentation. 

In contrast, another approach that enables hierarchical 
segmentation is to deflne a hierarchical vertex set in a graph. 
A set of vertices in the flner level will be connected to a 
vertex in coarser level. It introduces another set of edges 
which connect vertices at different levels of hierarchy. 

Our proposed approach on hierarchical inference takes 
computational advantages over graph representation with a 
hierarchical vertex set. Our proposed graph representation 
has less the number of vertices and edges because we have 
a single flnest level of hierarchy without additional vertices 
for coarser levels. This advantage not only enables an effi¬ 
cient graph inference, but also take less computation time to 
calculate node and edge potentials for additional vertex and 
edge sets. 


4.1. Dataset 

We evaluate the proposed model on VSBIOO video seg¬ 
mentation benchmark data provided by Galasso et al. 121 . 
There are a few additional video datasets which have pixel- 
wise annotation. FBMS-59 dataset ca consists of 59 video 
sequences and SegTrack v2 dataset ca consists of 14 se¬ 
quences. However, the both datasets annotate on a few ma¬ 
jor objects leaving whole background area as one label. It 
is more appropriate for object tracking or background sub¬ 
traction task. On the other hand, VSB100 consists of 60 test 
video sequences of maximum 121 frames. For each video, 
every 20 frame is annotated with pixelwise segmentation 
labels by four annotators. The dataset contains the largest 
number of video sequences annotated with pixelwise label, 
which allows quantitative analysis. The dataset provides a 
set of evaluation measurements. 

Volume Precision-Recall. VPR score measures overlap 
of the volume between the segmentation result of the pro¬ 
posed algorithm § and ground truths {Gi}f^i annotated by 
M annotators. Over-segmentation will have high precision 
with low recall score. 

Boundary Precision-Recall. BPR score measures over¬ 
lap between object boundaries of the segmentation result S 
and ground truths boundaries {Gi}f£i. Conversely to VPR, 
over-segmentation will have low precision with high recall 
scores. 


4.2. MSP-MRF Setup 


In this section, we present the detailed setup of our 
Multi-Cue Structure Preserving Markov Random Field 
(MSP-MRF) model for unconstrained video segmentation 


problem. As described in Section |3.2| we take a single 
threshold on image contour, so that each frame contains ap¬ 
proximately 100 superpixels. We assume that this granu¬ 
larity level is flne enough such that no superpixel at this 
level will overlay on multiple ground truth regions. Node 
potential ^ is evaluated for each superpixel with tem¬ 
poral smooth label obtained with agglomerative cluster¬ 
ing ifTOl . Although we chose the 11th flne grained level of 
hierarchy. Section [4^ illustrates that the proposed method 
shows stable performance over different label set size \C\ 
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Figure 3: Temporal consistency recovered by MSP-MRF. 

for node potential. Finally, edge potential is estimated as 
in 0, ([g. For color histograms, we used 50 bins for 
each CIELab color channel. In addition, 50 bins were 
set for horizontal and vertical motion of optical flow. For 
D-SIFT Bag-of-Words model, we used 5 dictionaries of 
K = 100, 200,400, 800,1000 words. Energy minimiza¬ 
tion problem in ([Tlfor MRF inference is optimized using 
FastPD algorithnlnfEl- 

4.3. Qualitative Analysis 

Figure [^illustrates a segmentation result on an airplane 
video sequence. MSP-MRF rectifles temporally inconsis¬ 
tent segmentation result of Col. For example, in the fourth 
column of Figure [^ the red bounding boxes show MSP- 
MRF rectifled label from Grundmann’s result such that la¬ 
bels across frames become spatio-temporally consistent. 

In addition, control parameter r successfully obtains dif¬ 
ferent granularity level of segmentation. For MSP-MRF, the 
number of region labels is decreased as r decreases. Figure 
[^ compares video segmentation results of MSP-MRF with 
Grundmann’s by displaying segmentation boundary on the 
same granularity levels, where the two methods have the 
same number of segments in the video. MSP-MRF infers 
spatial smooth object regions, which illustrates the fact that 
the proposed model successfully captures spatial structure 
of objects. 

4.4. PR Curve on High recall regions 

We speciflcally consider high recall regions of segmen¬ 
tation since we are typically interested in videos with rel¬ 
atively few objects. Our proposed method improves and 
rectifles state-of-the-art video segmentation of greedy ag- 
glomerative clustering cni, because we make use of struc¬ 
tural information of object boundary, color, optical flow, 
texture and temporal correspondence from long trajectories. 
Figure shows that the proposed method achieves signifl- 
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Figure 4: Comparison of segmentation boundary on the 
same granularity levels on two videos. 
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Figure 5: PR curve comparison to other models. 

cant improvement over state-of-the-art algorithms. MSP- 
MRF improves in both BPR and VPR scores such that it 
is close to Oracle which evaluates contour based superpix¬ 
els on ground truth. Hence, it is worth noting that oracle 
is the best accuracy that MSP-MRF could possibly achieve 
because MSP-MRF takes contour based superpixels from 
tH as well. 

The proposed MSP-MRF model rectifles agglomerative 
clustering by merging two different labels of vertices if it re¬ 
duces overall cost deflned in Q. By increasing the number 
of edges in the graph by lowering threshold value, the model 














































































BPR 

VPR 

Length 

NCL 

Algorithm 

ODS 

OSS 

AP 

ODS 

OSS 

AP 

n{S) 

l^ 

Human 

0.81 

0.81 

0.67 

0.83 

0.83 

0.70 

83.24(40.04) 

11.90 

Ochs and Brox 1171 

0.17 

0.17 

0.06 

0.25 

0.25 

0.12 

87.85(38.83) 

3.73 

Spectral Clustering Q 

0.51 

0.56 

0.45 

0.45 

0.51 

0.42 

80.17(37.56) 

8.00 

Segmentation Propagation 13 

0.61 

0.65 

0.59 

0.59 

0.62 

0.56 

25.50(36.48) 

258.05 

gQ = g^SCM 

0.62 

0.66 

0.54 

0.55 

0.59 

0.55 

61.25(40.87) 

80.00 


0.61 

0.64 

0.51 

0.58 

0.61 

0.58 

60.48(43.19) 

50.00 

Grundmann et al. ifTOl 

0.57 

0.62 

0.48 

0.61 

0.65 

0.61 

51.83(39.91) 

117.90 

MSP-MRF 

0.63 

0.67 

0.57 

0.65 

0.67 

0.64 

35.76(38.72) 

168.93 

Oracle (3 

0.62 

0.68 

0.61 

0.65 

0.67 

0.68 

- 

118.56 


Table 1: Performance of MSP-MRF model compared with state-of-the-art video segmentation algorithms on VSBIOO. 


Boundary Global PR Curve Volume Global PR Curve 



Figure 6: PR curve on different size of label set C. 


leads to coarser grained segmentation. As a result, MSP- 
MRF only covers higher recall regions from precision-recall 
scores of the selected label set size \C\ from ifTOl . A hy¬ 
brid model that covers high precision regions is described 
in Section l431 

Figure illustrates the PR curve of MSP-MRF on dif¬ 
ferent granularity levels of label set \C\ in node potential 
Dashed-green line is the result of greedy agglomerative 
clustering Co). Solid-green line is the result of MSP-MRF 
with edge threshold r set to 1, which leaves no edge in the 
graph. The figure shows that results of MSP-MRF are sta¬ 
ble over different size of |>C|, particularly in the high recall 
regions. 

4.5. Hybrid Model for Over Segmentation 

The proposed model effectively merges labels of each 
pair of nodes according to edge set £. As the number of 
edges increases, the size of the inferred label set will de¬ 
crease from |£|, which will cover higher recall regions. Al¬ 
though we are interested in high recall regions, the model 
needs to be evaluated on high precision regions of PR curve. 
For this purpose, we take a hybrid model that obtains recti¬ 
fied segmentation results from MSP-MRF on the high recall 
regions but retains segmentation result of mni on high pre¬ 
cision regions as an unrectified baseline. 

Table shows performance comparison to state-of-the- 


art video segmentation algorithms. The proposed MSP- 
MRF model outperforms state-of-the-art algorithms on 
most of the evaluation metrics. BPR and VPR is described 
in Section |4.1| Optimal dataset scale(ODS) aggregates 
F-scores on a single fixed scale of PR curve across all 
video sequences, while optimal segmentation scale(OSS) 
selects the best F-score with different scale for each video 
sequence. All the evaluation metrics are followed from 
dataset la. It is worth noting that our MSP-MRF model 
achieves best ODS and OSS results for both BPR and VPR 
evaluation measurements, which are equivalent to results of 
Oracle. As described in Section |4^ Oracle is a model that 
evaluates contour based superpixels on ground truth. 

MSP-MRF infers segmentation label by integrating ob¬ 
ject boundary, global structure and temporal smoothness 
based on cni. The result shows that incorporating bound¬ 
ary and global structure rectifies cni by significant mar¬ 
gin. It should be noted that result of ifTOl is higher than 
previously reported in 13. We assume this is due to imple¬ 
mentation updates on Col over recent years. Qualitatively, 
we observe that recent implementation of oni detects ob¬ 
jects whose appearance is less distinctive from background, 
where the previous implementation could not elucidate ob¬ 
jects under those circumstances. 

5. Conclusion 

In this paper, we have presented a novel video seg¬ 
mentation model that considers three important aspects of 
video segmentation. The model preserves object boundary 
by defining vertex set from contour based superpixels. In 
addition, temporal smooth label is inferred by providing 
unary node potential from agglomerative clustering label 
likelihood. Finally, global structure is enforced from pair¬ 
wise edge potential on object boundary, color, optical flow 
motion, texture and long trajectory affinities. Experimen¬ 
tal evaluation shows that the proposed model outperforms 
state-of-the-art video segmentation algorithm on most of the 
metrics. 
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